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Abstract 

Both common variants and rare variants are involved in the etiology of most complex diseases in humans. 
Developments in sequencing technology have led to the identification of a high density of rare variant single- 
nucleotide polymorphisms (SNPs) on the genome, each of which affects only at most 1% of the population. 
Genotypes derived from these SNPs allow one to study the involvement of rare variants in common human 
disorders. Here, we propose an association screening approach that treats genes as units of analysis. SNPs within a 
gene are used to create partitions of individuals, and inverse-probability weighting is used to overweight genotypic 
differences observed on rare variants. Association between a phenotype trait and the constructed partition is then 
evaluated. We consider three association tests (one-way ANOVA, chi-square test, and the partition retention 
method) and compare these strategies using the simulated data from the Genetic Analysis Workshop 17. Several 
genes that contain causal SNPs were identified by the proposed method as top genes. 



Background 

Rare variants are common on the genome and have long 
been speculated to be involved in the etiology of most 
human disorders [1]. In the 2000s, a large number of 
genome-wide association studies (GWAS) were con- 
ducted using relatively more common single-nucleotide 
polymorphisms (SNPs) (with minor allele frequency 
[MAF] > 5%). Most of the common variants identified 
in these studies have borderline odds ratios and can 
explain only a small fraction of susceptibility to a disease 
[2]. As a result, there has been increasing interest in the 
study of rare variants for complex diseases. This concern 
has also been fueled by advancements in sequencing 
technology. In particular, the availability of such tech- 
nology has directly led to the implementation of the 
1000 Genomes Project (http://www.1000genomes.org/), 
in which 1,000 genomes from individuals of different 
ethnic backgrounds were sequenced, consequently lead- 
ing to the identification of a large number of rare 
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variants (SNPs) with MAF < 1% and some very rare var- 
iants with MAF < 0.5%. Because of these low MAFs, 
association methods developed for common variants 
have limited efficiency for mapping rare variants in 
population studies. For these methods to have adequate 
power to detect individual rare variants, the sample size 
needs to increase substantially as the MAF decreases. 

It is also more likely for a rare variant to contribute to 
the susceptibility of a disease as part of a group of rare 
variants in the same gene or pathway. Therefore group- 
ing or collapsing rare variants is the most feasible option 
to improve efficiency in studying rare variants. Usually, 
the grouping is constructed on the basis of functional 
relevancy, physical proximity, or both. Once rare var- 
iants have been grouped, their genotypic information is 
combined, or collapsed, into a usually univariate score, 
and the association between the group of rare variants 
and the disease is then studied using the association 
between the univariate score and the disease traits. See 
Asimit and Zeggini [2] and Dering et al. [3] for excellent 
reviews of different methods for rare variant association 
analysis, including single-marker, multimarker, and 
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various collapsing strategies. A popular alternative to 
collapsing genotypic information is to combine single- 
SNP statistics. 

In this paper, we consider a gene-based association 
analysis for rare variants. This is equivalent to grouping 
based on the gene affiliation of SNPs. We propose using 
a clustering-based method for collapsing genotypic 
information of multiple SNPs within each gene. The 
clustering is based on an inverse-probability weighted 
sum of genotypic differences that highlights the varia- 
tion at rare variant loci. Association between the col- 
lapsed partition label and the disease traits can then be 
readily evaluated using single-marker association meth- 
ods, such as one-way analysis of variance (ANOVA), a 
chi-square test, and the partition retention method [4,5]. 
We apply our approach to the simulated data of the 
Genetic Analysis Workshop 17 (GAW17) without 
knowledge of the simulation models. After the work- 
shop, a comparison of our results with the simulation 
answers led to interesting observations regarding both 
the method and the simulated data. We discuss these 
observations in the Results section. 

Methods 

Data set 

The simulated data set of GAW17 is a combination of 
real sequence data and simulated phenotypes. An exome 
of 3,205 autosomal genes, corresponding to 24,487 
SNPs, was selected. Sequences of these SNPs were 
obtained from the 1000 Genomes Project on 697 unre- 
lated subjects. SNPs with missing values were imputed 
using fastPhase. A majority of the SNPs (74%) were rare 
variants (MAF < 1%). Two hundred phenotype sets were 
simulated based on these common genotype data. Each 
simulated unrelated-individual data set has three quanti- 
tative trait values (Ql, Q2, Q4) and the Affected status 
Y, with 209 case subjects and 488 control subjects. Gene 
information and SNP information were provided. Espe- 
cially, whenever available, SNPs were labeled as synon- 
ymous or nonsynonymous [6]. 

Gene-based grouping and collapsing of SNP genotypes 

We propose to evaluate an individual gene's association 
with disease traits. SNPs within a gene are grouped for 
the association analysis. Our main focus is a collapsing 
strategy for multiple-SNP genotypes within a gene. We 
propose to create partitions of individuals (or observed 
genotypes) based on their genotypic differences evalu- 
ated by inverse-probability weighted similarity scores. It 
is easier to start with considering alleles at a single SNP 
locus first. For two individuals, we can count when they 
have the same alleles or different alleles. When the 
MAF is small, the chance of having a random match for 
the major allele is high. On the other hand, if a rare 



variant is involved in the etiology of a disease, then the 
case subjects are more likely to have the same rare var- 
iants than the control subjects are. Therefore for rare 
variant association analysis we want to overweight the 
allelic or genotypic similarity for the minor alleles but 
not that for the major alleles. 

We use the inverse-probability weighted similarity 
score, as defined in Table 1. This score has a mean 
similarity 0, which is also a desirable property. The alle- 
lic similarity can be straightforwardly generalized to the 
genotypic similarity scores in Table 2. For example, an 
individual 1 with genotype aa and an individual 2 with 
genotype Aa will have one match (a, a) and one mis- 
match (a, A). Because a is the minor allele, the (a, a) 
match will dominate the (a, A) mismatch, and these two 
individuals will have a high similarity score. Such a 
weighting scheme implicitly assumes that individuals 
with the same rare variants will be clustered together 
for association analysis with the disease outcomes. 

We denote the genotypic similarity score between two 
individuals i and j at SNP k by sim(/, /<). For a given 
gene G, the similarity between i and ; is defined as the 
sum of the similarity scores on SNPs within the gene: 

sim(i, j) = sim(i, j; fe). ^) 

keG 

For the 697 individuals, pairwise similarity scores, the 
sim(/, ;), are evaluated first and are then converted to a 
distance measure using the transformation: 

d(i, j) = exp [-a sim(i, j) ] , (2) 

where a is a normalizing constant such that the dis- 
tance calculated at each gene is bounded by e 20 . We 
then apply hierarchical clustering using Ward's method 
[7] and partition individuals into groups by cutting the 
hierarchical clustering tree into a prespecified number 
of groups (we consider partition sizes of 5 to 10). See 
Figure 1 for an example using FLT1. We also take 
advantage of the synonymy information about the SNPs 
by carrying out two separate analyses using nonsynon- 
ymous SNPs only or every SNP in a gene. 

Table 1 Inverse probability similarity measure: allelic 
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Table 2 Inverse probability similarity measure: genotypic similarity scores 
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p a is the population frequency of minor allele a. 



Partition-based association analysis 

After obtaining the partition of individuals, for each 
gene we tested the association between the partition 
indexes obtained from the SNPs in that particular gene 
and the disease phenotypes. For the disease status Y, we 
considered one-way ANOVA, the chi-square test of 
independence, and the partition retention method [4]. 
For continuous-valued disease outcomes Ql, Q2, and 
Q4, we considered one-way ANOVA and the partition 
retention method. 

The partition retention method is based on association 
measure / defined between an outcome variable Y and a 
partition IT. More specifically, 



where n t is-the number of individuals in partitioji ele- 
ment i and 1 is the sample mean of element /. ^ and 
s are the sample mean and the standard deviation of all 
n individuals, respectively. Under the null hypothesis, / 
asymptotically converges to a weighted sum of chi- 
square distributions with 1 degree of freedom and there- 
fore has mean 1. The partition retention method is 
more robust to sparse partition than the chi-square test 
and can be applied to both dichotomous disease status 



Case/Control Labels 
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maf= 7e-04 
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maf= 7e-04 
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In this example, the 
hierarchical tree was 
"cut to form a partition 
of 5 elements. 



1 J- J 



partition element 2 



Figure 1 Clustering of individuals using nonsynonymous SNPs for FLT1. Each row is a SNP, and each column is an individual. Green vertical 
bars indicate case subjects. Genotype aA is plotted in blue, and genotype AA is plotted in white {a is the minor allele); the genotype oo was not 
observed. The partitions of the 697 individuals are indicated by dotted lines. Partition element 2 is driven by similarity on SNP C13S431 but not 
on the more common SNPs C13S522 and C13S523. 
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and continuous-valued traits [4]. Intuitively, the / in the 
partition retention method evaluates the amount of 
influence a particular gene has on the disease 
phenotypes. 

^-values for the ANOVA test and the chi-square test 
are derived from corresponding asymptotic distributions. 
To address the multiple testing issue, we control the 
family-wise error rate using the conservative Bonferroni 
correction. For the evaluation using the partition reten- 
tion 7, we simply chose the top 0.1% of genes for each 
trait. A further examination of results from chromosome 
4 revealed that, by using a cutoff of the top 0.1%, only 
15 of the 200 replicates returned any null gene (a 
family- wise type I error rate), which suggests that the 
top 0.1% is a reasonable threshold. In practice, we sug- 
gest evaluating ^-values using permutations and control- 
ling the false discovery rate in order to have better 
sensitivity to real genetic signals. 

Results 

Because we have 200 simulation sets, for each gene we 
counted the number of times it was selected (either in 
the top 0.1% for I using the partition retention method 
or significant by Bonferroni correction for ANOVA and 
the chi-square test) for each trait for each method. We 
also compared the effects of partition sizes (results not 



shown). The significance varied between different parti- 
tion sizes, and the partition size that corresponded to 
the most significant results also changed from simula- 
tion to simulation. Therefore we used the average count 
across six partition sizes (from 5 to 10) to rank genes. 
By visually examining the average counts (not shown), 
we observed that Ql had strong genetic signals and that 
Q2 and Affected status were harder to map. For Q4, the 
one-way ANOVA identified many noncausal genes, or 
false positives, to which the partition retention method 
was relatively more immune. 

Figure 2 summarizes the results from the 200 simula- 
tions. The top 10 genes for each method and each trait 
are plotted by chromosome. Note that for Q2 the top 
10 genes are identified less than 25% of the time and 
that the six genes that contain "answers" or causal genes 
are identified as top genes but with less than 5% prob- 
ability, with the exception of VNN1, which is identified 
by the partition retention method 22% of the time. Two 
genes for Ql (FLT1 and KDR) are identified in more 
than 50% of the simulated replicates. It is interesting to 
note that excluding synonymous SNPs led to better 
identification of FLT1 and had less effect on identifica- 
tion of KDR. 

To better understand the "consistent false positives" 
problem that arose during GAW17, we studied several 
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Figure 2 Top ten genes identified by each of the methods and for each of Y, Ql, 

chromosome. Genes with causal SNPs are highlighted (yellow for Q1 and blue for Q2). 



Q2, and Q4. Ninety-one genes are shown, displayed by 
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Table 3 Association between a consistent false-positive gene (OR2T3) and a causal SNP at C13S523 



C13S523 genotype (p = 1.8 x 10 18 by Fisher's exact test) 




Partition based on SNPs of OR2T3 
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consistent false-positive genes identified by our methods. 
All of them were found to be significantly associated 
with multiple causal SNPs. See Table 3 for an example 
between the gene OR2T3 on chromosome 1 and a cau- 
sal SNP at C13S523. 

We further investigated the relation between power to 
detect (probability of true positive) and the effect size of 
a gene. The effect size for each SNP is provided by 
Almasy et al. [6]. For each gene, we define its total effect 
size as: 



effect ^ = MAF,-/?,-, 



(4) 



SNP iGc 



where /J/ is the effect size P used in the simulation 
model for SNP i, which is 0 for noncausal SNPs. 

Figure 3 plots the frequencies of each gene with causal 
genes identified by the best performing method for each 
trait against the gene-wise effect size, that is, the one- 
way ANOVA with Bonferroni correction for Ql and Y 
and the / from the partition retention method for Q2. 
The power of our approach suffers greatly for extreme 
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Figure 3 Power to identify a causal gene versus effect size. For 

each trait, we plot the power to detect using the best performing 
method against the effect size used in the simulation model. That 
is, we plot the one-way ANOVA with Bonferroni correction for Q1 
and Y, and the / from the partition retention method for Q2. The 
gene-wise effect size is defined as the sum of SNP-wise MAF x 
causal SNP effect in the simulation model. 



rare variants if the effect size does not scale up as MAF 
drops. 

Discussion and conclusions 

In this paper, we propose a novel strategy for gene- 
based association analysis for genes with multiple poten- 
tially rare variants. The inverse-probability weighted 
clustering approach automatically adjusts weights for 
rare variants and overweights their genotypic variation 
when comparing individuals for an association study. 
Individuals are first partitioned on the basis of their 
genetic similarity on multiple SNPs in a gene, and this 
partition is then used to calculate association between a 
gene and a disease trait. 

We also considered several association scores and the 
effect of including synonymous variants. Different meth- 
ods seem to focus on nonoverlapping signals, which 
suggests a multimethod approach for future association 
studies. From our results, we can conclude that our 
method gains power by considering multiple rare var- 
iants in a gene, as illustrated in Figure 1 for one of our 
identified causal genes. It is probably beneficial to con- 
sider synonymous and nonsynonymous SNPs in future 
practice. Filtering out synonymous SNPs corresponds to 
a weight of 0 being assigned to synonymous SNPs and a 
weight of 1 being assigned to nonsynonymous SNPs, 
which can be extended to a smoother weighting scheme 
as a possible future direction. 

For this simulation study, we used asymptotic ^-values 
and the conservative Bonferonni correction because we 
needed to analyze 200 sets of data. In practice, we sug- 
gest evaluating ^-values using permutations and control- 
ling the false discovery rate in order to have better 
sensitivity to real genetic signals. Population information 
is provided with the simulated data. Some consistent 
false positives may have resulted from confounding due 
to population admixture. We recommend using existing 
methods, such as Eigensoft [8], to adjust for population 
stratification in real applications when applying our 
method. It should be pointed out that algorithms such 
as Eigensoft [8] may convert the original discrete geno- 
type data to continuous values, which requires modifica- 
tion to the similarity measure defined in Table 1. 
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